Consider the following scenario: a mother is trying to encourage her child to do the dishes at the end of every day. No matter how many times she expresses this to her child, they do not do the dishes. In this case, she wants to reinforce a particular behavior (washing the dishes). A great way to do this is to offer some sort of reward, like an allowance. If she were to give her child an allowance upon the successful washing of the dishes, she would find that this encourages her child to do their chores. This is an example prominent in psychology called positive reinforcement. In essence, positive reinforcement refers to the addition of something (in this example, an allowance) to reinforce a behavior (doing chores). This psychological phenomena is the logic that underpins reinforcement learning.
Reinforcement learning is a machine learning method that sits outside both supervised learning (which learns from a training set of labeled examples) and unsupervised learning (which identifies hidden structures from unlabeled data). Instead, reinforcement learning involves training an agent to learn how to make decisions to get the desired result through trial-and-error. The agent interacts with the environment, receives feedback in the form of rewards or penalties, and adjusts its actions to maximize the total reward over time.
There are four main elements in reinforcement learning:
Policy: an agent’s way of behaving at any given time
Reward Signal: immediate reward following an action
Value Function: the total amount of reward an agent can expect to accumulate in the future, starting from the current state
Model of Environment: allows for inferences of how the environment will behave, used for planning
A common reinforcement learning problem that has been studied for decades is the k-armed bandit problem. In this problem, our agent is in the room with \(k\) gambling machines (denoted as “one-armed bandits”). The agent is allowed a fixed number of pulls, \(n\). During each pull, any arm may be pulled. There is no cost to use these bandits; the only negative reward associated with a pull is wasting that turn. When arm \(i\) is pulled, it returns a reward determined by an underlying distribution. These payoffs are independent and the parameters of each arm’s distribution are unknown. The agent’s goal is to maximize the rewards by concentrating on pulling the best arms.
In our example, each arm’s expected reward will be determined by a Gaussian distribution. The expected reward is referred to as the value of the action \(a\) and i denoted \(q_*(a)\). The action selected on time step \(t\) is denoted as \(A_t\) and the corresponding reward as \(R_t\):
\[ q_*(a) = E[R_t|A_t = a] \]
In the Gaussian example, \(R_t \sim N(\mu, \sigma)\) so \(q_*(a) = \mu\).
Since the agent does not know the true parameters underlying the reward distribution, it needs to update its guesses as it tries various actions. The estimated value of action a at time step t is denoted as \(Q_t(a)\). One way to update these estimates is by averaging over the rewards already received for a given action a:
\[Q_t(a) = \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} = \frac{\sum_{i=1}^{t-1}R_i \cdot \textbf{1}_{A_i=a}}{\sum_{i=1}^{t-1}\textbf{1}_{A_i=a}}\]
At time step t, the agent has estimated values \(Q_{t-1}(a)\) for each possible action a. How does it decide which action to take next? The most obvious decision may be to select the action with the highest estimated value. This strategy is referred to as greedy action selection, which exploits the knowledge that the agent already has. But what if the agent estimates a high value for a suboptimal action early on? The agent will just keep selecting this action again and again, without ever exploring the other, more optimal actions.
The problem above illustrates one of the most fundamental trade-offs in reinforcement learning. Reinforcement models need to balance exploitation (using actions that it already knows will result in a reward) and exploration (trying new actions that could potentially result in a reward). A model with maximized exploitation will quickly find a solution, but will likely be suboptimal since it continues with the first actions that resulted in a reward. On the other hand, a model with maximized exploration will continuously discover new possible actions but never find a solution since it discounts which actions produce the reward.
One way to add exploration is by using the \(\epsilon\)-greedy algorithm. Agents developed using this algorithm will behave greedily with probability \(1-\epsilon\), but will sample from all actions randomly with probability \(\epsilon\). In other words, it will usually exploit its current knowledge, but will explore other actions every once in a while.
Randomized policies, on the other hand, are the exact opposite of a greedy policy. These policies select an action at random, with probability \(p\). As such, they always choose to explore the environment, rather than exploiting what they know to be the optimal action. There are some tweaks that can be made to these randomized policies to make them perform better.
One example is to start off with a large value of \(p\) to encourage initial exploration and slowly decrease the value over time. Another example is to perform as a purely exploratory algorithm for a certain number of iterations, before switching to a purely greedy algorithm. This strategy is known as the \(\epsilon\)-first policy.
This policy takes in two parameters: \(\epsilon\) and \(N\). For the first \(\epsilon \times N\) iterations, it will follow a purely random policy before switching to a purely greedy policy. In order to determine how different values of \(\epsilon\) affect the performance, I have applied four different policies to the same bandit, each with the different values of \(\epsilon\) explored above.
To compare the performance of the greedy, \(\epsilon\)-greedy algorithms, and \(\epsilon\)-first, we performed simulations of several 10-armed bandit problems.
Simulation 1: Reward Variance 1
Our first simulation used the following parameters to set the reward distributions:
\(R_t \sim N(\mu_a, 1)\) where \(\mu_a \sim N(0, 1)\).
We ran 2,000 simulations over 1,000 time steps, comparing four \(\epsilon\) values: 0, 0.01, 0.1, and 0.5. At time step 1,000, over the 2,000 simulations, the greedy algorithm (\(\epsilon = 0\)) only picked the optimal action 38% of the time. This makes sense, as this agent was not able to explore actions beyond what it had estimated as the best. The agent with \(\epsilon = 0.01\) gradually increased the rate at which it chose the optimal action, outperforming the greedy agent. The agent with \(\epsilon = 0.1\) performed even better, choosing the optimal action more often and accruing a high reward over time. However, the agent with \(\epsilon = 0.5\) produced the lowest average rewards. This agent only chose the action with the highest estimated reward 50% of the time and chose randomly otherwise. This led it to not be able to “learn” as well as the other agents.
For the \(\epsilon\)-first algorithm, we notice the opposite. Unsurprisingly, the agent that performs best has \(\epsilon = 0.5\). As this agent spends 50% of the time exploring the environment prior to exploiting it, it is expected that it would outperform the other agents, which spend significantly less time exploring the environment. Though the agent with \(\epsilon = 0.01\) spends more time exploiting the environment, it is not able to accrue a higher reward, since it hasn’t spent enough time exploring the environment to determine which action results in the highest reward.
Simulation 2: Reward Variance 10
Our next simulation increased the variance of possible rewards:
\(R_t \sim N(\mu_a, 1)\) where \(\mu_a \sim N(0, 10)\).
We expected that noisier rewards would require more exploration to figure out the optimal action and thus higher \(\epsilon\) values would be better. Although the \(\epsilon = 0.1\) still had the best performance, higher reward variance caused the \(\epsilon = 0.5\) agent to perform better than the \(\epsilon = 0.01\) agent, showing that sufficient exploration is important in this situation.
Similar to the \(\epsilon\)-greedy algorithm, the agent with \(\epsilon = 0.5\) performed better. This isn’t a change from the previous simulation with a lower reward variance.
Simulation 3: Reward Variance 0
Our next simulation decreased the variance of possible rewards to zero:
\(R_t \sim N(\mu_a, 1)\) where \(\mu_a \sim N(0, 0)\).
In this case, the true value of an action (\(\mu_a\)) is known by the agent after trying it just once. Exploration isn’t needed once each action has been tried, so we expect the greedier algorithms to perform better. However, our results do not show this: the greedy algorithm still performed worst and the \(\epsilon = 0.01\) algorithm did not do much better than in our first simulation. Again, we find that \(\epsilon = 0.01\) appears to be a good balance between exploitation and exploration.
Though in our previous simulations the agent with \(\epsilon = 0.5\) performs best when the \(\epsilon\)-first policy is implemented, here we see that the agent with \(\epsilon = 0.1\) performs just as well. Since the true value is known after trying an action once, we don’t need as much exploration as the agent with \(\epsilon = 0.5\) performs. For this policy, the agent with the best balance between exploration and exploitation is \(\epsilon = 0.1\).
Regret
Regret can be defined as the expected decrease in reward gained due to exploring, rather than behaving optimally from the very beginning. In other words, it penalizes mistakes whenever they occur during the run. In our application, regret is calculated by the following equation:
\[ \text{regret} = | \text{reward} - \text{optimal reward} | \]In our application, we have information on which arm is the optimal arm and what reward would result from pulling that optimal arm, giving us the ability to calculate regret. However, in other reinforcement learning applications, the regret of algorithms is a hard metric to obtain.
Reward
Rewards are variables associated with some states or state-action pairs. In the case of the K-armed bandits, there are 10 arms, each with their own associated reward, which is denoted by a positive or negative numeric value. The optimal arm is the arm with the highest reward.